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Abstract 

We present a parallel algorithm for the undirected s-t min-cut problem with floating-point 
valued weights. Our overarching algorithm uses an iteratively reweighted least squares framework. 

This generates a sequence of Laplacian linear systems, which we solve using parallel matrix 
algorithms. Our overall implementation is up to 30-times faster than a serial solver when using 
128 cores. 

1 Introduction 

We consider the undirected s-t min-cut problem. Our goal is a practical, scalable, parallel algorithm 
for problems with hundreds of millions or billions of edges and with floating point weights on the 
edges. Additionally, we expect there to be a sequence of such s-t min-cut computations where the 
difference between successive problems is small. The motivation for such problems arises from a 
few recent applications including the Flowlmprove method to improve a graph partition [1] and 
the GraphCut method to segment high-resolution images and MRI scans [5, 14], Both of these 
applications are limited by the speed of current s-t min-cut solvers that can handle problems with 
floating point weights. We seek to accelerate such s-t min-cut computations. 

For the undirected s-t min-cut problem, we present a Parallel Iteratively Reweighted least squares 
Min-Cut solver, which we call PIRMCut for convenience. This algorithm draws its inspiration from 
the recent theoretical work on using Laplacians and electrical flows to solve max-flow/min-cut in 
undirected graphs [8, 18, 24], However, our exposition and derivation is entirely self-contained. 
There are three essential ingredients to our approach. The first essential ingredient is a variational 
representation of the £ i-minimization formulation of the undirected s-t min-cut (Section 2.1). 
This representation allows us to use the iteratively reweighted least squares (IRLS) method to 
generate a sequence of symmetric diagonally dominant linear systems whose solutions converge to an 
approximate solution (Theorem 2.6). We show that these systems are equivalent to electrical flows 
computation (Proposition 2.3). We also prove a Cheeger-type inequality that relates an undirected 
s-t min-cut to a generalized eigenvalue problem (Theorem 2.7). The second essential ingredient is 
a parallel implementation of the IRLS algorithm using a parallel linear system solver. The third 
essential ingredient is a two-level rounding procedure that uses information from the electrical flow 
solution to generate a smaller s-t min-cut problem suitable for sequential s-t min-cut solvers. 

We have implemented PIRMCut on a distributed memory platform and evaluated its performance 
on a set of test problems from road networks and MRI scans. Our solver, PIRMCut, is 30 times 
faster (using 128 cores) than a state-of-the-art sequential s-t min-cut solver on a large road network. 
In the experimental results, we also demonstrate the benefit of using warm starts when solving 
a sequence of related linear systems. We further show the advantage of the proposed two-level 
rounding procedure over the standard sweep cut in producing better approximate solutions. 
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At the moment, we do not have a precise and graph-size based runtime bound on PIRMCut. 
We also acknowledge that, like most numerical solvers, it is only up to ^-accurate. The focus of this 
initial manuscript is investigating and documenting a set of techniques that are principled and could 
lead to practically fast solutions on real world s-t min-cut problems. We compare our approach 
with those of others and discuss some further opportunities of our approach in the related work and 
discussions (Sections 4, 6). 

2 An IRLS algorithm for undirected s-t min-cut 

In this section, we describe the derivation of the IRLS algorithm for the undirected s-t min-cut 
problem. We first introduce our notations. Let Q = (V,£) be a weighted, undirected graph. Let 
n = |V|, and m = \£\. We require for each undirected edge {u,v} G £, its weight c({u,v}) > 0. Let 
s and t be two distinguished nodes in Q, and we call s the source node and t the sink node. The 
problem of undirected s-t min-cut is to find a partition of V = S U S with s£5 and t G S such 
that the cut value 

cut (5,5)= c({u,v}) 

{u,v}g£ ,u£S ,v£S 

is minimized. For the interest of solving the undirected s-t min-cut problem, we assume Q to be 
connected. We call the subgraph of Q induced on V\{s,f} the non-terminal graph, and denote it by 
Q = (V,£). We call the edges incident to s or t the terminal edges, and denote them by £ T . 

The undirected s-t min-cut problem can be formulated as an -minimization problem. Let 
B G { — 1,0, iy nxn be the edge-node incidence matrix corresponding to an arbitrary orientation of 
G's edges, and C be the m x m diagonal matrix with c({u, u}) on the main diagonal. Further let 
f = [1 0] , and <F = [e s e t ] where e s (e*) is the s-th (t- th) standard basis, then the undirected s-t 
min-cut problem is 

minimize IICBxIL 

x 11 111 (1) 

subject to <hx = f, x G [0, l] n . 

2.1 The IRLS algorithm 

The IRLS algorithm for solving the £i -minimization problem (1) is motivated by the variational 
representation of the l\ norm 

Mi = min w>o \ Ei + w ») 

Using this variational representation, we can rewrite the -minimization problem (1) as the following 
joint minimization problem 

minimize R(x, w) = - W A (2) 

x,w>0 2 z —' V Wj / 

i 

subject to 4>x = f , x G [0, l] n (3) 

The IRLS algorithm for (1) can be derived by applying alternating minimization to (2) [2]. Given 
an IRLS iterate x^ -1 ^ satisfying constraint (3), the next IRLS iterate x® is defined according to 
the following two alternating steps: 

• Step 1. Compute the reweighting vector w®, of which each component is defined by 

w f } = CBxa-D)f + e2 (4) 

where e > 0 is a smoothing parameter that guards against the case of (CBx^^)j = 0. 
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Step 2. Let W® = diag( w^)), update by solving the weighted least squares (WLS) 
problem 

minimize -x r B T C(W ( ' ) )- 1 CBx subject to $x = f. (5) 


We prove that the sequence of iterates from the IRLS algorithm is well-defined, that is, each 
iterate x® exists and satisfies x^ G [0, l] n . Solving (5) leads to solving the saddle point problem 

' L« $ T 1 [ x 1 [ 0 1 , 

4 O A = f (6) 


Regarding the nonsingularity of the linear system (6), we have the following proposition. 

Proposition 2.1 A sufficient and necessary condition for the linear system (6) to be nonsingular 
is that Q is connected. 

Proof When Q is connected, ker(L^) = spanje}, where e is the all-one vector. From the definition 
of 4>, we know e 0 ker(<l>), i.e., ker(L^) n ker(<l>) = {0}. With $ being full rank, we apply Theorem 
3.2 from [3] to prove the proposition. | 

We now prove that the solution x® to (6) automatically satisfies the interval constraint. 

Proposition 2.2 If Q is connected, then for each IRLS update we have x® G [0, l\ n . 

Proof According to proposition 2.1, the linear system (6) is nonsingular when Q is connected. We 
apply the null space method [3] to solve it. Let Z be the matrix from deleting the s, t columns of 
the identity matrix I n , then Z is a null basis of <f>. Because d>e s = f, using Z and e s we reduce (6) 
to the following linear system 

(Z T L (Z) Z)v = -Z T L w e s (7) 

where v is the vector by deleting the s and t components from x. Note = — Z T L^e s > 0 
because L^ is a Laplacian. We call L^ = Z L^Z the reduced Laplacian, and (1W)) _1 > 0 because 
it’s a Stieltjes matrix (see Corollary 3.24 of [31]). Thus v® = (L®) _1 b® > 0. Also note that 
L^e > b®, thus L^(e — v®) = L^e — b® > 0. Using (L®) -1 > 0 again, we have e — > 0. 

In summary, we have 0 < v® < e. § 

Now define 

L^ = B T C(W W ) _1 CB (8) 

we note that L® is a reweighted Laplacian of Q, where the edge weights are reweighted by the 
diagonal matrix (W^^) _1 . We also define the reduced Laplacian to be L*-^ = Z L^Z, in which Z 
is the matrix from deleting the s,t columns of the identity matrix I n . 

In applying the IRLS algorithm to solve the undirected s-t min-cut problem, we start with an 
initial vector xO satisfying (3). We take x^°) to be the solution to (5) with = C. Then we 
generate the IRLS iterates x^ for l = 1,..., T using (4) and (5) alternatingly. Finally we apply 
a rounding procedure on x (i > to get a binary cut indicator vector. We discuss the details of the 
rounding procedure in Section 3.4. Essentially what the IRLS algorithm does is solving a sequence of 
reduced Laplacian systems for l = 1,... ,T. It is already known that undirected max-flow/min-cut 
can be approximated by solving a sequence of electrical flow problems [8]. We make the connection 
between the IRLS and the electrical flow explicit by proposition 2.3. Because of this connection 
between IRLS and the electrical flow, we also call x^d the voltage vector in the following. 
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Proposition 2.3 The WLS (5) solves an electrical flow problem. Its flow value is (xW) T LWxW. 

Proof Let x® be the solution to (5) and z® = C(W®) _1 CBx®. From |(x®) T L®x® = 
^(z®) T C _1 W®C _1 zW, z® minimizes the energy function E( z) = ^z T C X W®C x z. We now 
prove that z® satisfies the flow conservation constraint. 

B T z W = B T C(W®) _1 CBx® = L®x® 


where the last equality is by the first equation in (6). From <f> T = [e s e*] , we have (B T z®) u = 0 
when u s,t. Also note that e T B T z® = 0. Thus z® is an electrical flow. From x.® = 1 and 
x® = 0, the flow value of z® is given by 

H{ z®) = (x®) T B T z® = (x®) T B T C(W®) _ 1 CBx® 

= ( x ®) T L®x® | 


2.2 Convergence of the IRLS algorithm 

We now apply a general proof of convergence from Beck [2] to our case. Because of the smoothing 
parameter e introduced in defining the reweighting vector w® as in (4), the IRLS algorithm is 
effectively minimizing a smoothed version of the -minimization problem (1) defined by 


minimize S e (x) = YaLi \J (CBx)? + 
subject to <Lx = f, x £ [0, l] n . 


Using results from Beck [2], we have the following nonasymptotic and asymptotic convergence 
rates of the IRLS algorithm. 

Theorem 2.4 (Theorem 4.1 of [2]) Let {x®}/>o be the sequence generated by the IRLS method 
with smoothing parameter e. Then for any T > 2 


( T-l 

SJx <T> ) - S e * < max | f M (Sfly 




<T- 1) 


where S e * 


is the optimal value of (9), and R is the diameter of the level set of (2) (see (3.7) in [2]). 


Theorem 2.5 (Theorem 4.2 of [2]) Let {x®}j> 0 be the sequence generated by the IRLS method 
with smoothing parameter e. Then there exists K > 0 such that for allT > K + 1 


- S e * < 


48 R 2 
<T-K) 


Regarding the convergence property of the iterates {x®};>o, we have the following theorem. 

Theorem 2.6 (Lemma 4.3 of [2]) Let {x®};> 0 be the sequence generated by the IRLS method 
with smoothing parameter e. Then 

(i) any accumulation point o/{x®}/>o is an optimal solution of (9). 

(ii) for any i € {!,... ,m}, the sequence {|(CBx®)j|}/> 0 converges. 
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2.3 A Cheeger-type inequality 

In Section 2.1 we have shown that the IRLS algorithm relates the undirected s-t min-cut problem 
to solving a sequence of reduced Laplacian systems. Here we prove a Cheeger-type inequality that 
characterizes the undirected s-t min-cut by the second finite generalized eigenvalue of a particular 
matrix pencil. We start with some notations and definitions. Let C = 2 J2{ u v}e£ c (i u i ^}) be., twice 
the total edge weights of Q. We define a weight function d(-) on V to be 


d(u) 


C u = s or u = t 
0 otherwise 


( 10 ) 


The volume of a subset S C V is defined by vol(«S) = Ylues d(u). As usual we define 0(5) = 
min{voi(!^)’voi(g)} ’ an d the expansion parameter to be 0 = min^y 0(5). We have 0(5) < oo if and 
only if (5,5) is an s-t cut. And 0(5) = 0 if and only if (5,5) is an s-t min-cut. Let D be the 
n x n matrix with its main diagonal being d, and let L be the Laplacian of Q . We consider the 
generalized eigenvalue problem of the matrix pencil (L, D) 

Lx = ADx (11) 

Because D is of rank 2, the pencil (L, D) has only two finite generalized eigenvalues (see Definition 

3.3.1 in [30]). We denote them by Ai < A 2 . We state a Cheeger-type inequality for undirected s-t 
min-cut as follows. 

Theorem 2.7 Let 0 be the min-cut and let A 2 be the non-zero eigenvalue of (11), then < A 2 < 
20. 

We defer this proof to the appendix (Section A.l) because it is a straightforward generalization of 
the proof of Theorem 1 in [9]. It also follows from a recent generalized Cheeger inequality by [22]. 

3 Parallel implementation of the IRLS algorithm 

In this section, we describe our parallel implementation of the IRLS algorithm. In addition, we also 
detail the rounding procedure we use to obtain an approximate s-t min-cut from the voltage vector 
x^ output from the IRLS iterations. 

3.1 A parallel iterative solver for the sequence of reduced Laplacian systems 

The main focus of the IRLS algorithm is to solve a sequence of reduced Laplacian systems for 
l = 1 ,...,T. We keep L® = Z T L^Z and set = — Z T L^e s in the following; note that the 
system L^v = is equivalent to (5) as explained in the proof to Proposition 2.2. Because L® is 
symmetric positive definite, we can use the preconditioned conjugate gradient (PCG) algorithm 
to solve it in parallel [26]. For preconditioning, we choose to use the block Jacobi preconditioirer 
because of its high parallelism. Specifically, we extract from L^^ a p x p block diagonal matrix 
M® = diag(M^,.,,, M®), where p is the number of processes. We use M® to precondition 
solving the linear system of L^b In each preconditioning step of the PCG iteration, we need to solve 
a linear system involving Because of the block diagonal structure of M*^, the p independent 

subsystems 

= rj j = l,...,p (12) 

can be solved in parallel. We consider two strategies to solve the subsystems (12): (I) LU factorization 
to exactly solve (12); and (II) incomplete LU factorization with zero fill-in (ILU(O)) to approximately 
solve (12). Given that the nonzero structure of L^^ never changes (it’s fully determined by the 
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non-terminal graph Q), the symbolic factorization of LU and ILU(O) needs to be done only once. 
As we will demonstrate in Section 5, applying (I) or (II) highly depends on the application. Given 
that we are solving a sequence of related linear systems, where the solution v® is used to define 
P+ 1 ) and b^ +1 \ we adopt a warm start heuristic. Specifically, we let v® be the initial guess to 
the parallel PCG iterations for solving the (l + l)-th linear system. In contrast, cold start involves 
always using the zero initial guess. 

3.2 Parallel graph partitioning 

Given a fixed number of processes p, we consider the problem of extracting from an effective and 
efficient block Jacobi preconditioner For this purpose, we impose the following two objectives: 

(i) The Frobenius norm ratio ||M^||f/||L (/) ||f is as large as possible. 

(ii) The sizes and number of nonzeros in M^, j = 1 ,p are well balanced. 

The above two objectives can be formulated as a k- way graph partitioning problem [16, 17]. Let 
be the reweighted non-terminal graph whose edge weights are determined by the off-diagonal entries 
of L (/ b Ideally we need to apply the k- way graph partitioning to for each l = 1,... ,T. However, 
according to (8) we observe that large edge weights tend to be large after reweighting because of 
the C 2 in (8). This motivates us to reuse graph partitioning. Specifically, we use the parallel graph 
partitioning package ParMETIS 1 [16] to partition the non-terminal graph Q into p components 
with the objective of minimizing the weighted edge cut, while balancing the volumes of different 
components. We then use this graph partitioning result for all the following IRLS iterations. The 
graph partitioning result implies a reordering of the nodes in V\{s, t} such that nodes belonging to 
the same component are numbered continuously. Let the permutation matrix of this reordering be 
P. We then extract M^) as the block diagonal submatrix of the reordered matrix PL®P . 

3.3 Graph distribution and parallel graph reweighting 

The input graph Q consists of the non-terminal graph Q and the terminal edges £^. Let L be the 
graph Laplacian of Q. For distributing Q among p processes, we partition the reordered matrix 

— T 

PLP into p block rows and assign each block row to one process. The graph partitioning also 

p 

induces a partition of the terminal edges as £ r = [J £j. and each £j is assigned to one process. 

3 =i 

Accordingly we also partition and distribute the permuted vectors Px® and Pb® among p processes. 
At the Z-th IRLS iteration, we need to form the reweighted Laplacian (8) using the voltage vector 
Px^" 1 ). After the p processes have communicated to acquire their needed components of Px( i_1 \ 
they can form their local block row of PL Wp T and Pb^ in parallel. Thus the cost of one parallel 
graph reweighting is equivalent to that of one parallel matrix-vector multiplication. Note that the 
k-w&y graph partitioning usually results in a small number of edges between different components. 
This characteristic helps to reduce the process communication cost. 

3.4 A two-level rounding procedure 

Let x- 7 '- 1 be the voltage vector output from running the IRLS algorithm for T iterations. We consider 
the problem of converting x- 7 ) to a binary cut indicator vector. A standard technique for this 
purpose is sweep cut [32], which is based on sorting the nodes according to x 77 \ Here we propose a 
heuristic rounding procedure, which we refer to as the two-level rounding procedure. Empirically we 
observe the components of x^ tend to converge to {0,1} as l gets larger. We call this effect the 

: http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview 
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node voltage polarization. This observation motivates the idea of coarsening the graph Q using xO. 
Given and another two parameters 70 and 71 , we partition V = So U U S c as follows: 

So, if xi T) < 70 

u <E < Si, if xi T) > 71 
(T) 

, S c , if 70 < x u < 71 

(T) 

Intuitively, we coalesce node u with the sink (source) node if x„ ; is small (large) enough, and 
otherwise we leave it alone. Contracting So (Si) to a single node t c (s c ), we define a coarsened 
graph Q c = (V c , £ c ) with node set V c = {s c , t c } U S c . 

The weighted edge set £ c of the coarsened graph is defined according to the following cases 

• {u,v} G £ c with weight c({i,j}) if u,v G S c such that {u,v} G £. 

• {s c ,u} G £c with weight )T, ;e<Si c({u,u}). 

• {t c , u} G £ c with weight J2 v eS 0 c ({ n > u })- 

• {s c ,t c } G £ c with weight £ ueiSl E„ e s 0 c (( u > u })- 

An s-t cut on Q c induces an s-t cut on Q. The idea of the two-level rounding procedure is 
to solve the undirected s-t min-cut problem on the coarsened graph Q c using a combinatorial 
max-flow/min-cut solver, e.g., [ 6 ]; and then get the corresponding s-t cut on Q. Regarding the 
quality of such an acquired s-t cut on 5 , we have the following proposition. 

Proposition 3.1 Let V = So U Si U S c be the partition generated by coarsening using x (I ) and 
parameters 70 , 71 - If there is an s-t min-cut on Q such that Si is contained in the source side, and 
So is contained in the sink side, then an s-t min-cut on Q c induces an s-t min-cut on Q. 

Choosing the values of 70 and 71 is critical to the success of the two-level rounding procedure. On 
the one hand, we want to set their values conservatively so that not many nodes in So and Si are 
coalesced to the wrong side. On the other hand, we want to set their values aggressively so that the 
size of Q c is much smaller than Q. We find a good trade off is achieved by clustering the components 
of x 1 ' 1 f Let the two centers returned from K -means on x 1 ' 1 - 1 (with initial guesses 0.1 and 0.9) be Co 
and ci, then we let 70 = Co + 0.05 and 71 = ci — 0.05. 

Algorithm 1 PIRMCut ( Q,s,t,p,T ) 

Require: A weighted and undirected graph Q = (V,S), the source node s, the sink node t, the 
number of processes p, and the number of IRLS iterations T. 

Ensure: An approximate s-t min-cut on Q. 

1 : Let Q be the non-terminal graph and £^ be the terminal edges. Use ParMETIS to partition Q 
into p components. Reorder and distribute Q and £ T to p processes accordingly. 

2 : Initialize x^ to be the solution to (5) with W® = C using the parallel PCG solver. 

3: for l = 1 ,,T do 

4: Execute the parallel IRLS algorithm by alternating between parallel graph reweighting and 

solving the reduced Laplacian system L^v = using the parallel PCG solver. 

5: end for 

6 : Gather the voltage vector x . 1 ' 1 ) to the root process. 

7: The root process applies a rounding procedure on x^ to get an approximate s-t min-cut on Q. 

We summarize the algorithm PIRMCut in Algorithm 1. Note the main sequential part of 
PIRMCut is the rounding procedure (either sweep cut or our two-level rounding). 
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4 Related work 


Approaches to s-t min-cut based on £ 1 -minimization are common. Bhusnurmath and Taylor [4], 
for example, transform the £ 1 -minimization problem to its linear programming (LP) formulation, 
and then solve the LP using an interior point method with logarithmic barrier functions. Similar 
to our IRLS approach, their interior point method also leads to solving a sequence of Laplacian 
systems. In that vein, [8] develops the then theoretically fastest approximation algorithms for s-t 
max-flow/min-cut problems in undirected graphs. Their algorithm involved a sequence of electrical 
flow problems defined via a multiplicative weight update. We are still trying to formalize the 
relation between our IRLS update and the multiplicative weight update used in [8]. More recent 
progress of employing electrical flows to approximately solve the undirected s-t max-flow/min-cut 
problems to achieve better complexity bounds include [18, 24]. In contrast, the main purpose of 
this manuscript is to implement a parallel s-t min-cut solver using this Laplacian paradigm, and 
evaluate its performance on a parallel computing platform. Beyond the electrical flow paradigm, 
which always satisfies the flow conservation constraint but relaxes the capacity constraint, the fastest 
combinatorial algorithms, like push-relabel and its variants [7, 13] and pseudoflow [15], relaxes the 
flow conservation constraint but satisfying the capacity constraint. However, parallelizing these 
methods is difficult. 

Recent advances on nearly linear time solver for Laplacian and symmetric diagonally dominant 
linear systems are [19, 21, 25]. We do not incorporate the current available implementations of 
[21] and [25] into PIRMCut because they are sequential. Instead we choose the block Jacobi 
preconditioned PCG because of its high parallelism. We notice an idea similar to our two-level 
rounding procedure that is called flow-based rounding has been proposed in [23]. This idea was later 
rigorously formalized as the Flowlmprove algorithm [1]. 

5 Experimental results 

In this section, we describe experiments with PIRMCut on several real world graphs. These 
experiments serve to empirically demonstrate the properties of PIRMCut, and also evaluate its 
performance on large scale undirected s-t min-cut problems. In the following experiments, we 
only consider floating-point valued instances because most interesting applications of undirected 
s-t min-cut produce floating-point valued problems, e.g., Flowlmprove [1], image segmentation 
[5], MRI analysis [14], and energy minimization in MRF [20]. Thus, we do not include those 
state-of-the-art max-flow/min-cut solvers for integer valued problems in the following experiments, 
like hipr-v3.4 2 [7] and pseudo-max-3.23 3 [15]. We would have liked to compare different parallel 
max-flow/min-cut solvers [12, 27, 29] regarding their performance and parallel scalability, but a 
detailed, fair comparison is beyond the scope of this extended abstract. 

5.1 Data Sets 

The real world graphs we use to evaluate PIRMCut are from two sources: road networks and N-D 
grid graphs. The first three road networks in Table 1 are from the University of Florida Sparse 
Matrix collection [11], From these road graphs, we generate their undirected s-t min-cut instances 
using the Flowlmprove algorithm [1] starting from a geometric bisection. The other three N-D grid 
graphs in Table 1 are the segmentation problem instances from the University of Western Ontario 
max-flow datasets 4 [5], on which we convert the edge weights to be floating-point valued by adding 
uniform random numbers in [0,1]. 

2 http://www.igsystems.com/hipr/download.html 

3 http://riot.ieor.berkeley.edu/Applications/Pseudoflow 

4 http://vision.csd.uwo.ca/maxflow-data 



Table 1: All graphs are undirected and we report the size of the non-terminal graph. 


Graph 

|V| 

\£\ 

usroads-48 

126,146 

161,950 

asia_osm 

11,950,757 

12,711,603 

euro_osm 

50,912,018 

54,054,660 

adhead.n26cl00 

12,582,912 

163,577,856 

bone.n26cl00 

7,798,784 

101,384,192 

liver.n26cl00 

4,161,600 

54,100,800 



Figure 1: The number of PCG iterations to reach a residual of 10 3 is reduced by roughly 20% by 
using warm starts for this sequence of Laplacian systems. 

5.2 The effect of warm starts 

On the graph of usroads-48, we demonstrate the benefit of the warm start heuristic described in 
Section 3.1. We set the smoothing parameter e = 10 ~ 6 and run the 1RLS algorithm for 50 iterations 
on 4 MP1 processes. For each 1RLS iterate, we set the maximum number of PCG iterations to be 
300, and the stopping criterion in relative residual to be 10 -3 . In Figure 1 we plot the number of 
PCG iterations of using warm starts and cold starts, ft is apparent that for most IRLS iterates, 
warm starts help to reduce the number of needed PCG iterations significantly, especially for later 
IRLS iterates. Another interesting phenomenon we observe from Figure 1 is that the difficulty of 
solving the reduced Laplacian system increases dramatically during the first several IRLS iterates, 
and then decreases later on. A possible explanation is that the IRLS algorithm makes faster progress 
during the early iterates. 

5.3 Effect of node voltage polarization 

Continuing with usroads-48 as an example, we demonstrate node voltage polarization, which 
motivates the idea of two-level rounding procedure (see Section 3.4). We plot a heatmap of x® 
(after sorting its components) for l = 0,..., 50 in Figure 2. ft is apparent that the polarization gets 
emphasized as l becomes larger. We also see from Figure 2 that the values of change quickly for 
l < 10, which reinforces the observation of the IRLS algorithm making faster progress early. 
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Figure 2: Node voltage polarization. Each row is a sorted voltage vector 


5.4 Performance evaluation on large graphs 

We evaluate the empirical performance of PIRMCut on the five largest graphs in Table 1. Our parallel 
implementation of PIRMCut is written in C++, and it is purely based on MPI, no multi-threading 
is exploited. In particular, the parallel PCG solver with block Jacobi preconditioner is implemented 
using the PETSc 5 package [28]. All the experiments are conducted on a cluster machine with a total 
number of 192 Intel Xeon E5-4617 processors (8 nodes each with 24 cores). All the results reported 
in the following are generated using: smoothing parameter e = 10 —6 , number of IRLS iterations 
T = 50, warm start heuristic, and 50 PCG iterations at maximum. And on the road networks, 
we use exact LU factorization by UMFPACK [10] to solve the subsystems (12), while on the N-D 
grid graphs we use ILU(O) in PETSc [28]. For implementing the two-level rounding procedure in 
PIRMCut we use the Boykov-Kolmogorov solver 6 [6], which is a state-of-the-art max-flow solver 
that can handle floating-point edge weights. We refer to it as the B-K Solver in the following. 




Figure 3: Parallel scalability of the IRLS iterations of PIRMCut on four large graphs. 

We first study the parallel scalability of the IRLS iterations. The timing results of parallel IRLS 
iterations on four graphs are plotted in Figure 3 7 . In Figure 3, we see the speedup starts is initially 
superlinear. This happens because, as p gets large, the total work of applying the block Jacobi 
preconditioner decreases. On asia.osm, we can achieve linear speedup until p = 128. On two of 
three N-D grid graphs, the linear speedup stops after p = 64. One reason is that the N-D grid 
graphs are denser than the road networks (see Table 1), which would incur high communication 

’http://www.mcs.anl.gov/petsc/ 

°http://pub.ist.ac.at/~vnk/software.html 

7 The result on euro_asm is not plotted because its time on 8 cores will dwarf the others. 
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cost when p is large. Another reason is that the use of ILU(O) on N-D grid graphs makes the work 
reduction not that critical after p is large enough. 

In Table 2 we show the time of different phases of PIRMCut for the given number of cores shown 
in the last column. We also show in the second to last column the size reduction ratio achieved 
during the two-level rounding procedure. On the graphs of adhead.n26cl00 and bone .n26cl00, 
the size reduction is so dramatic that the time taken by the two-level rounding procedure is even 
less than that of sweep cut on the original graph. However, on the graph of liver .n26cl00, the 
size reduction is not effective and the time of the sequential two-level rounding procedure is even 
much more than that of IRLS iterations. 


Table 2: Time of different phases of PIRMCut. All the times are in seconds (rounded to integers). 
The second to last column shows the size reduction ratio of the two-level graph coarsening. 


Graph 

Graph Partition 

IRLS 

Sweep Cut 

Two-level 

|V|/|V C | 

Number of cores 

asia.osm 

5 

28 

3 

16 

80.5 

128 

eurcuosm 

7 

185 

17 

109 

36.2 

128 

adhead.n26cl00 

1 

172 

6 

1 

432.4 

64 

bone,n26cl00 

1 

99 

4 

1 

376.9 

64 

liver.n26cl00 

1 

25 

2 

69 

10.1 

64 


Table 3: Time comparison between PIRMCut and B-K Solver. All the times are in seconds (rounded 
to integers). We use the time of two-level for getting the total time of PIRMCut. 


Graph 

PIRMCut 

B-K Solver 

Speedup 

Number of Cores 

asia_osm 

49 

1465 

29.8 

128 

euro_osm 

301 

3102 

10.3 

128 

adhead.n26cl00 

174 

555 

3.2 

64 

bone.n26cl00 

101 

215 

2.1 

64 

liver.n26cl00 

95 

294 

3.1 

64 


Finally, we compare the performance of PIRMCut with that of B-K Solver. The B-K Solver 
is a sequential code, and we run it on one Intel Xeon E5-4617 processor. The total time taken 
by PIRMCut and B-K Solver respectively on the five large graphs are shown in Table 3. On the 
road networks, our PIRMCut achieves desirable speedup using 128 cores. Especially on asia_osm, 
PIRMCut is almost 30-times faster than B-K Solver. In contrast, we find B-K Solver to be really 
efficient on the N-D grid graphs. It is interesting that even though the road networks are planar 
and sparser, they seem to be much harder than the N-D grid graphs for the B-K Solver. 

We then use the output of the B-K Solver to evaluate the quality of the approximate s-t min-cut 
achieved by sweep cut and the two-level rounding procedure. Specifically, we denote by p* the s-t 

Table 4: Solution quality of sweep cut and two-level. The right two columns are the relative 
approximation ratio 6. 


Graph 

Sweep Cut 

Two-level 

asia_osm 

2.1 x 10“ 3 

3.3 x 10“ 5 

euro.osm 

3.2 x 10 -2 

6.8 x 10 -5 

adhead.n26cl00 

1 x 10" 4 

0 

bone.n26cl00 

9.7 x 10 -4 

0 

liver.n26cl00 

4.9 x 10“ 2 

8.8 x 10“ 4 
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min-cut value output from B-K Solver, and /j the s-t min-cut value output from PIRMCut. Then 
we compute the relative approximation ratio 

6 = (»-!?)/»* (13) 

As shown in Table 4, the two-level rounding procedure always produces a much better solution than 
sweep cut, which is a justification of it taking more time on three graphs in Table 2. 

6 Discussion 

The PIRMCut algorithm is a step towards our goal of a scalable, parallel s-t min-cut solver for 
problems with hundreds of millions or billions of edges. It has much to offer, and demonstrates good 
scalability and improvement over an expertly crafted serial solver. 

We have not yet discussed how to take advantage from solving a sequence of related s-t min-cut 
problems because we have not yet completed our investigation into that setting. That said, we have 
designed our framework such that solving a sequence of problems is reasonably straightforward. 
Note that all of the set up overhead of PIRMCut can be amortized when solving a sequence of 
problems if the graph structure and edge weights do not change dramatically. The larger issue 
with a sequence of problems, however, is that our solver only produces <5-accurate solutions. For 
the applications where a sequence is desired, we need to revisit the underlying methods and see if 
they can adapt to this type of approximate solutions. This setting also offers an opportunity to 
study the accuracy required for solving a sequence of problems. It is possible that by solving each 
individual problem to a lower accuracy, it may generate a sequence with a superior final objective 
value for application purposes. Thus, the goal of our future investigation on a sequence of problems 
will attempt to solve the sequence faster and better through carefully engineered approximations. 

There are also a few further investigations we plan to conduct regarding our framework. First, we 
used a distributed memory solver for the diagonally dominant Laplacians. Recent work on the nearly 
linear time solvers for such systems shows the potential for fast runtime, but the parallelization 
potential is not yet evident in the literature. We wish to explore the possibility of parallelizing these 
nearly linear time solvers as well as incorporating them into our framework. 

Second, we wish to make our two-level rounding procedure more rigorous. As we have mentioned, 
this method is similar to Lang’s flow-based rounding for the spectral method [23] that was later 
formalized in [1], We hypothesize that a similar formalization will hold here based on which we will 
be able to create a more principled approach. Furthermore, we suspect that a multi-level extension 
of this will also be necessary as we continue to work on larger and larger problems. 

Finally, we plan to continue working on improving the analysis of our PIRMCut algorithm in an 
attempt to formally bound its runtime. This will involve designing more principled stopping criteria 
based on tighter diagnostics, e.g., from applying our Cheeger-type inequality. 

References 

[1] R. Andersen and K. J. Lang. An algorithm for improving graph partitions. In Shang-Hua Teng, 
editor, ACM-SIAM Symposium on Discrete Algorithms , pages 651-660. SIAM, 2008. 

[2] A. Beck. On the convergence of alternating minimization with applications to iteratively 
reweighted least squares and decomposition schemes. Optimization Online , 2013. 

[3] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems. ACTA 
NUMERICA , 14:1-137, 2005. 

[4] A. Bhusnurmath and C. J. Taylor. Graph cuts via i\ norm minimization. IEEE Transactions 
on Pattern Analysis and Machine Intelligence , 30(10):1866-1871, 2008. 


12 



[5] Y. Boykov and G. Funka-Lea. Graph cuts and efficient N-D image segmentation. International 
Journal of Computer Vision , 70(2): 109-131, 2006. 

[6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms 
for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine 
Intelligence , 26(9): 1124-1137, 2004. 

[7] B. V. Cherkassky and A. V. Goldberg. On implementing the push-relabel method for the 
maximum flow problem. Algorithmica, 19(4):390-410, 1997. 

[8] P. Christiano, J. A. Kelner, A. Madry, D. A. Spielman, and S. Teng. Electrical flows, Laplacian 
systems, and faster approximation of maximum flow in undirected graphs. In STOC, pages 
273-282. ACM, 2011. 

[9] F. Chung. Four Cheeger-type inequalities for graph partitioning algorithms. In Proceedings of 
ICCM, 2007. 

[10] T. A. Davis. Algorithm 832: UMFPACK V4.3—an unsymmetric-pattern multifrontal method. 
ACM Transactions on Mathematical Software, 30(2)496-199, 2004. 

[11] T. A. Davis and Y. Hu. The university of Florida sparse matrix collection. ACM Transactions 
on Mathematical Software , 38(1)4, 2011. 

[12] A. Delong and Y. Boykov. A scalable graph-cut algorithm for N-D grids. In IEEE Conference 
on Computer Vision and Pattern Recognition. IEEE Computer Society, 2008. 

[13] Andrew V. Goldberg. Two-level push-relabel algorithm for the maximum flow problem. In 
Andrew V. Goldberg and Yunhong Zhou, editors, 44/M, volume 5564 of Lecture Notes in 
Computer Science, pages 212-225. Springer, 2009. 

[14] D. Hernando, P. Kellman, J. Haidar, and Z. Liang. Robust water/fat separation in the presence 
of large field inhomogeneities using a graph cut algorithm. Magnetic Resonance in Medicine, 
63(l):79-90, 2010. 

[15] D. S. Hochbaum. The pseudoflow algorithm: A new algorithm for the maximum flow problem. 
Operations Research, 58(4):992-1009, 2008. 

[16] G. Karypis and V. Kumar. A parallel algorithm for multilevel graph partitioning and sparse 
matrix ordering. Journal Parallel and Distributed Computing, 48(l):71-95, 1998. 

[17] G. Karypis and V. Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. 
SIAM Review, 41(2):278-300, 1999. 

[18] J. A. Kelner, Y. T. Lee, L. Orecchia, and A. Sidford. An almost-linear-time algorithm for 
approximate max flow in undirected graphs, and its multicommodity generalizations. In 
ACM-SIAM Symposium on Discrete Algorithms, pages 217-226. SIAM, 2014. 

[19] J. A. Kelner, L. Orecchia, A. Sidford, and Z. Zhu. A simple, combinatorial algorithm for solving 
SDD systems in nearly-linear time. In Proceedings of the forty-fifth annual ACM symposium 
on Theory of computing, pages 911-920. ACM, 2013. 

[20] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 26(2)447-159, 2004. 


13 



[21] I. Koutis, G. L. Miller, and D. Tolliver. Combinatorial preconditioners and multilevel solvers for 
problems in computer vision and image processing. Computer Vision and Image Understanding, 
115(12): 1638-1646, 2011. 

[22] Ioannis Koutis, Gary Miller, and Richard Peng. A generalized cheeger inequality. Personal 
communication and public presentation, 2014. 

[23] K. Lang. Fixing two weaknesses of the spectral method. In NIPS, 2005. 

[24] Y. T. Lee, S. Rao, and N. Srivastava. A new approach to computing maximum flows using 
electrical flows. In STOC, pages 755-764. ACM, 2013. 

[25] O. E. Livne and A. Brandt. Lean Algebraic Multigrid (LAMG): fast graph Laplacian solver. 
SIAM Journal on Scientific Computing, 34(4):B499-B522, 2012. 

[26] Y. Saad. Iterative methods for sparse linear systems. SIAM, 2003. 

[27] A. Shekhovtsov and V. Hlavac. A distributed mincut/maxflow algorithm combining path 
augmentation and push-relabel. International Journal of Computer Vision, 104(3) :315-342, 
2013. 

[28] B. Smith. PETSc (Portable, Extensible Toolkit for Scientific Computation). In D. A. Padua, 
editor, Encyclopedia of Parallel Computing, pages 1530-1539. Springer, 2011. 

[29] P. Strandmark and F. Kahl. Parallel and distributed graph cuts by dual decomposition. In 
IEEE Conference on Computer Vision and Pattern Recognition, pages 2085-2092. IEEE, 2010. 

[30] S. Toledo and H. Avron. Combinatorial preconditioners. In U. Naumann and O. Schenk, 
editors, Combinatorial Scientific Computing. Chapman & Hall/CRC, 2012. 

[31] R. S. Varga. Matrix iterative analysis, Second Edition. Springer, 2000. 

[32] N. Vishnoi. Lx = b. Laplacian solvers and their algorithmic applications. Foundations and 
Trends in Theoretical Computer Science, 8(1-2):1—141, 2012. 


A Appendix 

A.l Proof of the Cheeger-type inequality 

We first prove the following characterization of A 2 - 

Proposition A.l A 2 is the optimal value of the constrained WLS problem 

1 t 

mm ——-x Lx 

x 2 C 

s.t. x s = 1, x t = -1 


Proof By inspection, Ai = 0 with x = e. Thus, A 2 can be represented by 


x T Lx 

A 2 = min —^- 

e T Dx=0X Dx 


(15) 


T 

The constraint e Dx = 0 is equivalent to x s + x* = 0. Because the representation (15) is invariant 

T 

to scaling of x, we can constrain x s = 1 and xj = — 1, which results in x Dx = 2 C. jj 
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Note that (14) has a form similar to (5), where the main difference is that we use {1, —1} to encode 
the boundary condition. We now prove the Cheeger-type inequality. 

Theorem A.2 < A 2 < 2<j) 

Proof We first prove the direction of A 2 < 2 cf>. Let ( S,S ) be an s-t min-cut such that s G S and 
t G S. Then (j>(S) = (f>. We define the {1, —1} encoded cut indicator vector 

_ J 1 if u E S 

X “ { — 1 if u G S 

This vector satisfies e Dx = 0. Then by representation (15) 

x T Lx 4 cut(S,S) 

M = 2C 

= 2<f>(S) = 2 (j) 

L 2 

We then prove the direction of ^- < A 2 . We follow the proof technique and notations used in 
Theorem 1 of [9]. Let g denote the solution to (14). Using an argument similar to Proposition 2.2, 
we can show — 1 < g (u) < 1. We sort the nodes according to g such that 

1 = g(s) = g(ui) > • • • > g(v n ) = g(t) = -1 (16) 

Let = {'iq,..., Vi} for i = 1,..., n — 1 and we denote by vol(5 ? ) = min{vol(5j), vol(<Sj)} = C. We 
let a = min^T] 1 4>(Si). Generalizing the proof to Theorem 1 in [9] we can show 

A > (Eu~„ c({u, ^})(g+(^) 2 ~ g+(^) 2 )) 2 

2 “ (E u g+(^) 2 d(M))(E M ^c({u,u})(g+(«) + g + (u)) 2 ) 

To bound the denominator, we have 

c ({“’ y })(s+{u) + g+(u)) 2 

IL~V 

<2 u })(g+(“) 2 + g+(^) 2 ) < 2g+(s) 2 d(-s) 

where the last inequality follows from g+(u) < g+(s) and d(s) = C = 2 E u ~„ c({u, v})- Because 
g+(t) = 0 and according to the definition of (10) we have E u g+(^) 2 d(rt) = g+(s) 2 d(.s). By 
telescoping the term g + (u) 2 — g+(u) 2 using the ordering (16), we then have 

(EEi^g+K) 2 “ g+K+i) 2 )cut(5i,5i)) 


2 


2 (g + (s) 2 d(s)) 2 2 

where the last equality follows from E”=/(g+Ou) 2 — g+(i’i+i) 2 )vol(5,:) = g + (s) 2 d(s) because 
vol(5,:) = C = d(s) for i = 1,..., n — 1. From a > <f> we have A 2 > | 


2(g+(s) 2 d(s)) 2 

a 2 (EEi^g+K) 2 - g+(^+i) 2 )vol(5i)) 
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